Roadmap to the study of gene and protein phylogeny and evolution—A practical guide

Developments in sequencing technologies and the sequencing of an ever-increasing number of genomes have revolutionised studies of biodiversity and organismal evolution. This accumulation of data has been paralleled by the creation of numerous public biological databases through which the scientific community can mine the sequences and annotations of genomes, transcriptomes, and proteomes of multiple species. However, to find the appropriate databases and bioinformatic tools for respective inquiries and aims can be challenging. Here, we present a compilation of DNA and protein databases, as well as bioinformatic tools for phylogenetic reconstruction and a wide range of studies on molecular evolution. We provide a protocol for information extraction from biological databases and simple phylogenetic reconstruction using probabilistic and distance methods, facilitating the study of biodiversity and evolution at the molecular level for the broad scientific community.

We understand the reviewers concern here. There is a delicate balance when condensing a complex field into a practical guide that is aimed also for beginners (as well as somewhat more advanced users). We have tried to go into some more depth while clarifying how to choose the paths forward. More specifically, we have added several tools and bioinformatic methods for sequence alignment and trimming, model selection and phylogenetic inference, as well as comments on their specificities, including models of molecular evolution and bootstrapping methods. We edited the protocol, so it is more comprehensive, more clear and more userfriendly for non-bioinformatic users (and also stated this in the introduction). We detailed more the specificities of databases and tools in Tables 1, 2 , 3, 4, 5, 6, 7, 8, 9, 10, and their corresponding paragraphs to facilitate the choice of appropriate method for the protocol users.
We explained more in detail the different methods for sequence alignments, and added several programs that can be used, with their specificities and comments on how to choose between them, mostly based on the size of the dataset. Specifically, edits are made on: D e p a r t m e n t o f L a b o r a t o r y M e d i c i n e T r a n s l a t i o n a l C a n c e r R e s e a r c h  and MUSCLE [36] are included in MEGA [44]. They display web interfaces, as well as MAFFT [40], Kalign [39], and PRANKS [37,55]. PROBCONS [42], T-COFFEE [41] and MAFFT [40] are described to have particularly high accuracy but also high calculation times [45]. They should be restricted to small and intermediate datasets. CLUSTAL Omega [35] and Kalign [39] are particularly fast, but less accurate [46]. They can be used for datasets of up to 4000 and 2000 sequences, respectively [45,46]. The performances of MUSCLE are intermediate [46]. PRANK is particularly accurate for large sets and closely related sequences. Bali-Phy [47] performs a bayesian coestimation of alignment, phylogeny, and other parameters and is also argued to be very reliable. PASTA [48] and UPP [49], that uses a machine learning technique, are designed for very large datasets. MAFFT offers a wide range of methods, which can be accuracy-oriented, such as L-INS-i, G-INS-i and E-INS-i; or speed-oriented, such as FFT-NS-2, which can be used for up to 30 000 sequences. ' We also added information on the different phylogenetic methods. They are now in separate paragraphs. We added several programs that can be used for phylogenetic inference, with some of their specificities and comments on how to choose between them, mostly based on the size of the datasets, but also their diverse options (type of data, models implemented, branch support test). Specifically, edits are made on:  (Table 7). PhyloBayes [113,114] is a bayesian MCMC sampler for phylogenetic reconstruction with protein data using a specific probabilistic model, well adapted for large datasets and phylogenomics. Bali-Phy [47] can also be used for phylogenetic analysis using Bayesian inference.' To make the protocol more comprehensible and easier of use for beginner, we also added information on bioinformatic tools and how to use them, in the text and in the tables. More specifically, we added precisions on the tools used by the workflows.
Row 399-406: 'The Cyberinfrastructure for phylogenetic research (CIPRES science gateway) [129] is a public resource for phylogenetic analysis that includes most tools and software for sequence alignment, model selection, and phylogenetic inference, including BEAST, FastTree, GARLI, IQ-TREE, jModelTest, MAFFT, MrBayes, PAUP, PhyloBayes and RAxML. Other packages include NGPhylogeny [130], a web service for phylogenetic analysis from sequence alignment to tree inference, and Phylemon [131], a suite of web tools for phylogenetics, phylogenomics, molecular evolution studies and hypothesis testing.' Row 507-513: 'Structure alignments can be realized to compare protein functions and evolution, and the mean distance in Å between homologous residues can be calculated. I-TASSER [159] and HHPred of the HH-suite software [160] can predict 3-dimentional structure for protein sequences using homology information. FoRSA [161] uses a structural alphabet known as Protein Blocks to identify a protein fold from its amino acid sequence, or to identify a protein sequence in the proteome of a species from a crystal structure by calculating a likelihood score.'   References were added to each of these tools.
Reviewer 2: In general, I think there is merit in publishing a manuscript like the one submitted. However, I also think you could make it more comprehensive and more accurate, especially regarding the section on the protocol and phylogenetic analysis. Regarding the protocol, I think readers would need to have a more detail decision tree that offers them alternative paths, depending on what their objectives and data are. You need to incorporate some measures of quality control at different steps in the protocol, and you need feedback loops that are followed in case the result of a quality control is that the preceding analysis did not yield the expected result (otherwise you will perpetuate errors made at the early steps). In this regard, your protocol is a little like that in Ciccarelli et al. (2006;Science 311, 1283Science 311, -1287. We are grateful that the reviewer also sees the merit in this practical guide to the study of gene and protein phylogeny and evolution for those willing to get into the field. This first comment is certainly also valid. Therefore, we have edited the text so that an extra step to assess phylogenetic assumptions after the alignment trimming, and several tools are present in the protocol. We added feedback loops in the protocol ( Figure 1).
Row 88-91: 'Feedback loops illustrate the necessity to control the quality of the alignment, to assess phylogenetic assumptions and to test the robustness of the tree, and to go back to previous steps to redo the analysis if necessary.' We also added several tools proposed by Reviewer 2. This makes the manuscript more comprehensive, especially for sequence alignment and trimming, model selection and phylogenetic analysis. We also edited the text to clarify the specificities of numerous tools and methods so users can select the best fit method depending on their data and objectives, depending on the size of their datasets. More specifically, we have edited at the following paragraphs: Row 248-256 : 'Once the alignment is completed, it is necessary to select the positions and regions that will be used for the phylogenetic inference. Poorly aligned positions and highly variable regions are not phylogenetically informative, because these positions might not be homologous or subject to saturation. These positions should be manually or automatically excluded prior to the phylogenetic analysis.

D e p a r t m e n t o f L a b o r a t o r y M e d i c i n e
T r a n s l a t i o n a l C a n c e r R e s e a r c h We also included more steps to the protocol, including a paragraph on the validation of phylogenetic assumptions, based on a new tool that has been implemented in IQ-TREE, and another paragraph on bootstrapping methods.
Row 261-269: 'Most phylogenetic methods rely on simplifying assumptions stating for example that all sites in the alignment evolved under the same tree, that mutation rates have remained constant, and that substitutions are reversible. Once the alignment is performed and the sites selected for phylogenetic inference, a recent phylogenetic protocol recommends assessing those phylogenetic assumptions when possible [58]. If the phylogenetic data violate these assumptions, the phylogeny and evolutionary analyses can be biased with most common phylogenetic programs [59]. Several statistical methods have been developed. Recently, tests for all these assumptions have been included in IQ-TREE [60,61]. It is also possible to use the R package MOTMOT [62].' I welcome that you cite a lot of databases, but could you move their citations from where they are to right after the name? Sometimes, a citation is at the end of the sentence and might be mistaken for a citation to something else (e.g., other databases or multiple sequence alignment).
Yes, thanks for pointing this out. Now, we have edited through all the manuscript the text to consistently have citations right after the name of every database and for software every time they appear in a paragraph.
Regarding the phylogenetic analysis (L194-L364), I compliment you on a valiant attempt to cover a large and complex field. However, I do not think you succeeded because I found gaping holes: 1. You seem unaware of previous phylogenetic protocols, one of which appeared recently in NAR Genomics & Bioinformatics (2, lqaa041).
Thanks for pointing this paper out to us. Now, we have edited figure 1 and the text according to this recently published protocol, and cite this work. We have edited the text on: Row 264-265: 'a recent phylogenetic protocol recommends assessing those phylogenetic assumptions when possible [58]' Row 370-373: 'For model-based methods, a recent phylogenetic protocol recommends to test the goodness of fit between tree, model and data using a parametric bootstrap [58,128]. Bayesian inference method calculates posterior probabilities, which measure branch support instead of bootstrap values.' 2. Because the manuscript does not present something novel but summarizes bioinformatics tools and resources, chiefly databases, you need to be comprehensive. Unfortunately, you were not comprehensive. This applies to both multiple sequence alignment methods and phylogenetic methods.
We have edited the text to highlight that the main aim of this work is to provide a roadmap for the study of gene and protein evolution for the broad scientific community and scientist that are new to the field. Our intention is indeed to be comprehensive, but also simplify the decisionmaking process for the reader while choosing the most appropriate tools for their scientific endeavor. Therefore, we believe it is important to have a balance between being exhaustive and practical enough in a way that will also incorporate beginners. This paper does not aim to present all tools that exist for MSA or phylogenetic analysis but rather a set of tools which are diverse, widely used in the scientific community and also user friendly. Specially for beginners in the field, it is easier to navigate a list of software that is actively maintained by the scientific community, and for which getting support (in the form of tutorials, documentations, publications, online support groups) is easier. Still, we added several tools and methods for sequence alignment, trimming, model selection and phylogenetic inference (see responses below). We have edited text: Row 5 Thanks for pointing out also these methods. Related to our response in the previous point, we recognize that there are several other tools that have been developed. We have therefore added a majority of these methods in the text and in Table 4.

D e p a r t m e n t o f L a b o r a t o r y M e d i c i n e
T r a n s l a t i o n a l C a n c e r R e s e a r c h However, several of the tools mentioned by reviewer #2 are not maintained, or have not been updated in several years. Sometimes they are not accessible (links in the publications are broken, do not exist and so one needs to contact the authors to get the source code). We recognize the value of these tools and their contribution to the field, but we believe including them in our manuscript may only create confusion. For those reasons, we have not included Probalign, PicXAA, FSA, GramAlign, and MSAProbs.
4. You mention one method for trimming sites from multiple sequence alignment (GBlocks; in Table 4). There is a suite of other and more suitable methods available (see NAR Genomics & Bioinformatics 2, lqaa024; see also citations 13-21 in that paper).
Yes, we thank the reviewer for this thought. We added a full paragraph on alignment trimming and mention several tools from these publications.
Row 248-256: 'Once the alignment is completed, it is necessary to select the positions and regions that will be used for the phylogenetic inference. Poorly aligned positions and highly variable regions are not phylogenetically informative, because these positions might not be homologous or subject to saturation. These positions should be manually or automatically excluded prior to the phylogenetic analysis. 5. You mention three model-selection methods but overlooked other more flexible methods (Nature Methods 14, 587-589;Mol. Biol. Evol. 29, 1695-1701Syst. Biol. 63, 726-742;Mol. Biol. Evol. 34, 772-773;Syst. Biol. 69, 249-264).
Yes, we thank the reviewer also for pointing this out. We included several new tools in the text and in    6. Your understanding of the relationship between log-likelihood and the AIC and BIC is wrong (L232-L233), suggesting confusion. You should read Briefings in Bioinformatics (21, 533-565).
We have now edited this sentence to make it more accurate and clear.  , 1911-1912).
We acknowledge that the text can be much clearer. Therefore, we have added a full paragraph with information, precision and references on bootstrapping method and other tests of tree robustness. See edits on Row 355-373: 'Once the phylogenetic tree is obtained, it is recommended to estimate the robustness of the nodes. Most programs of phylogenetic analysis use the non-parametric bootstrapping method [121]. Bootstrapping is an estimate of error used to assess the repeatability of the clade and the how consistently the data support the nodes [121,122]. The characters (e.g., nucleotides or amino acids) are randomly resampled with replacement and a new phylogeny is calculated for each replicate. A bootstrap value is calculated for every node, indicating the proportion of replicate phylogenies that recovered the node from the initial tree. A bootstrap value of 100% means that the node is supported by all informative characters, while low values mean that only few characters support the node. A bootstrap value above 95% is usually considered very good and a bootstrap value below 75% is generally considered a poor support for the clade. 1000 replicates are often used in phylogenetic analysis. Since bootstrapping can be time consuming, fast approximation methods for phylogenetic bootstrap have been proposed and are implemented in programs such as RAxML or IQ-TREE [123][124][125]. Yes, the reviewer is correct that not all programs are presented. This is, as mentioned above, a deliberation to balance the overview with a guide that is accessible also to novices to the field. For phylogenetic analysis tools we have the same criteria as for the MSA (not include tools which are not maintained or are difficult to access). Therefore, we added most of these tools to be more comprehensive, with comments on their specificities. This concerns particularly ML methods and Bayesian inference. (see also responses to comments 2-4, and 9). More specifically, the following is edited:  T r a n s l a t i o n a l C a n c e r R e s e a r c h fast and accurate [111]. PAUP [99,103] is slower than other programs, and uses nucleotide data only.' Row 343-349: 'The most recent method for phylogenetic reconstruction uses Bayesian inference, that calculates the probability of the molecular evolution model given the data. The main software used for BI-based phylogenetics is MrBayes [112] that uses the Markov Chain Monte Carlo (MCMC) algorithm ( Table 7). PhyloBayes [113,114] is a bayesian MCMC sampler for phylogenetic reconstruction with protein data using a specific probabilistic model, well adapted for large datasets and phylogenomics. Bali-Phy [47] can also be used for phylogenetic analysis using Bayesian inference.'   10. You really need to ensure that software and methods referred to are cited properly, preferentially every time and with version numbers included.
Yes, we have now been more careful on this. See response above (comments 8 and 10). In essence, we have edited the text to include the historical papers describing the different distance methods, as well as all software and methods.
11. Your figures are unclear, and the colours used are not consistent with a colour palette fit for colourblind people.